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Abstract 

We propose a flexible and identifiable version of the two-groups model, moti- 
vated by hierarchical Bayes considerations, that features an empirical null and a 
semiparametric mixture model for the non-null cases. We use a computationally 
efficient predictive recursion marginal likelihood procedure to estimate the model 
parameters, even the nonparametric mixing distribution. This leads to a nonpara- 
metric empirical Bayes testing procedure, which we call PRtest, based on thresh- 
olding the estimated local false discovery rates. Simulations and real-data examples 
demonstrate that, compared to existing approaches, PRtest's careful handling of 
the non-null density can give a much better fit in the tails of the mixture distribution 
which, in turn, can lead to more realistic conclusions. 

Keywords and phrases: Dirichlet process; marginal likelihood; mixture model; 
predictive recursion; two-groups model. 



1 Introduction 



Large-scale multiple testing problems arise in many appli ed fields such as genomic s 
( iDudoit and van der Laanl 120081: ISchafer and Strimmerll2005l ). proteomic s (IGhoshl 20091) . 
astrophysics flLiang et al.ll2004l : MUler et all bOoU T and image analysis (jLindquistl 120081 



Schwartzman et al. 20081 ). to name a few. An abstract representation of the probl 
testing a set of hypotheses 



em is 



H 0i : the i th case manifests a "null" behavior, 



n 
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based on summary test statistics, or z-scores, Z\, . . . , Z n . The null behavior of a single 
z-score Zj can be described by the N(0, 1) distribution when is defined as the Gaussian 
transform of a test statistic derived for the i th case, such as the two sample t-statistic 
comparing treatment to control. Although this characterization leads to a simple re- 
jection rule for the i th case in isolation, it is found insufficient when all n tests in are 
to be performed, particularly when n is very large. In fact, one of the major develop- 
ments of modern statistics has been the philosophical shif t from treating the z-score s 
as mutually independent to treating them as exchangeable (lEfron and Tibshiranill2002l ). 
Consequently, recent work on large-scale simultaneous testing has focused on Bayesian 
models and, in particular, empirical Bayes methods that allow for information sharing 
between cases, even though separate decisions will be made for each case. 

An elegant formalization of the larg e-scale simultaneous testing problem is the two- 
groups model f lEfronl I2004L 120071 120081 ) which assumes Z\ y . . . , Z n arise from a mixture 
density 

f(z) = nf (z) + (l-n)f 1 (z), (1) 

with fn and ft , res pectively, describing the null and non-null distributions of the z-scores. 
Efron fl2004j . 120081 ) argues that, for a variety of reasons, the case-specific theoretical null 
distribution N(0, 1) may not be an adequate choice for f , and a more appropriate choice 
is the so-called empirical null distribution N(/i, a 2 ), where \i and a are to be estimated 
from data. 

Following Efron's original treatment, various new methods have been proposed for 
fitting and drawing inference from the two groups model of z-scores (IJin and Cail 120071 ; 
Murafidharanl 120101 ). These methods, together with related metho dology based on p- 



values or t-scores (e.g., iBenjamini and Hochbergi Il995l ; IStorevI 120031 ). have been widely 



used in biological studies with high-throughput data, in particular to identify genes re- 
sponsible for a phenotypical behavior based on microarray analysis. The single- summary- 
per-case approach of these methods offers substantial computational advantage over other 
approaches to analyze such data, such as thos e based on high- dimensional classification 



techniques (IGolub et alJll999t iLee et al.ll2003l ) 



However, currently available methods for fitting (TjQ) do not take full advantage of the 
two-groups formulation. Motivated by applications to microarray studies, where typically 
a very small fraction of genes are linked with the phenotype, existing two-groups methods 
take a conservative approach of encouraging estimates of it close to 1. While this is 
reasonable for many applications, there are scientific studies where such a conservative 
approach fails to detect any or a majority of t he interesting cases. Figure [1] reports two 
such mic roarray studies, a leuke mia study by iGolub et al.l (119991 ) and a breast cancer 
study by iHedenfalk et al.l (1200 ll ); more details are given in Section EJ As shown in the 
figure, existing methods each produce estimates of the null component irfo that cover 
one or both tails of the z-score histogram, leaving little to be explained by the non-null 
component (1 — 7r)/i. Consequently, zero discoveries of interesting genes are made in one 
or both tails ; see Table B] in Section El High-dimension al classification-based analyses 
(IGolub et al.lll999l : IHedenfalk et al.llioOlt ILee et al.ll2003h . on the other hand, identify a 
number of interesting genes on either tail for each of the two studies. 

In this paper we consider a new likelihood-based analysis of the two-groups model, 
with a regularization on /i, a, ir and a semiparametric specification of the non-null density 
fi. We employ a mixture representation of f\ that gives it heavier tails than f to reflect 
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(a) Leukemia z-scores 



(b) Breast cancer z-scores 



Figure 1: Density histogram of z-scores from leuke mia microarray data (IGolub et al 
1999 ) and breast c ancer data ( Hedenfalk et al. 200 ll ) and e stimates of irfn based on the 
methods of lEfronl fl2004h (— ). |jin and Cail ll200'ft \T--) and iMuralidharanl (bold ) (••■)■ 



the belief that z-scores from the non-null cases are likely to be larger in magnitude than 
those from the null cases. The null weight ir is given a beta prior with a center close to 
one but with a relatively long left tail. Additionally, we use a prior on (/i, er) to reflect 
the belief that this vector is likely to be close to (0, 1). 

Compared to the existing methods based on z-scores, our proposal allows a wider 
range of estimates of ir. For scientific studies where the existing methods discover a 
fair number of interesting cases, our method makes similar discoveries. But for other 
studies where existing methods fail, such as the two studies mentioned earlier, our method 
makes discoveries that are comparable to those found via high-dimensional classification 
methods. A similar adaptability property manifests in a simulation study where z-scores 
are generated according to (pP) with 7r ranging between 0.75 to 0.99; see Table [1] in 
Section [5j 

Despite a nonparametric specification of f\ and a likelihood-based analysis, our treat- 
ment of the two-groups model retains the computational efficiency that is hallmark of 
methods based on z-scores. This has bee n possible due to recent developments on a 
stochastic algorithm due to iNewtonl (120021 ) called predictive recursion, for estimation of 



mixing densities with respect to any arbitrary dominating measure; se e alsolNewton et al. 
(1998). Theoretical properties of this algorithm are addr essed in iGhosh and Tokdar 



(I2006f). iMartin and Ghosh! (12008! ). iTokdar et all f l2009h . and lMartin and Tokdar! tj2009h . 



Martin and Tokdar! ( 120111 ) show how this algorithm can be used in a hierarchical mixture 



model to construct a likelihood function over non-mixing model parameters, marginalized 
over the mixing density. This marginal likelihood is shown to have strong connections 
to the marginal likelihood under a Bayesian Dirichlet process mixture model. We adopt 
this marginal likelihood calculation to the two groups model, with /i, a, ir and a scaling 
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parameter in the specification of f\ serving as the non-mixing parameters. 

For the multiple testing problem, we adopt the strategy of mimicking the Bayes oracle 
rule by thre sh olding a plug-in est imate of the local fa l se disc overy rate, similar to lEfron 



(12004 120081 ). iJin and Cail (120071 ). and iMuralidharanl ( I2010f ). Simulations presented in 



Section O show that the proposed method, called PRtest, is more adaptive to asymmetry 
in the non-null density f\ and to the degree of sparsity characterized by 7r. Perform a nce o f 
PRtest in an interesting example using the artificial microarray data of lChoe et al.l ( 120051 ) 
is addressed in Section [6j In this example, the set of interesting genes is known and we find 
that PRtest performs considerably better than existing methods and strikingly similar 
to the oracle. Likewise, for the leukemia and hereditary breast cancer studies, we find 
that the PR-based estimation produces a better fit in the tails of the distribution than 
that seen in Figure Q] and, consequently, we are able to identify a number of interesting 
genes in each example. The identified genes are, in fact, consistent with those identified 
by more sophisticated high-dimensional classification-based techniques. 



2 Model specification 

We take fo(z) = H(z\ n, a 2 ), the normal density with unknown mean and variance [i and 
a 2 . The non-null density f± is taken to be a semiparametric mixture of the form 

fi{z) = / N(z | fi + Tau,a 2 )tp(u)du, (2) 

with ip a density with respect to the Lebesgue measure on % = [—1,1] and r > 1 a 
scaling factor. An important consequence of the requirement that ip be a density is given 
in the following theorem; see Appendix A for a proof. 

Theorem 1. For f and f\ as described above, the parameters (fi, a, n, r, ip) in our 
version of the two-groups model are identifiable. 

This result is useful because, in general, identifiability is not guaranteed for a two 
groups model ([1]) with an empirical null that involves unknown parameters. For our 
specification, the key to identifiability is the model feature that fi, by virtue of averaging 
over locations shifts of fo, has heavier tails than / - This feature is scientifically relevant 
as it embeds the belief that z-scores in the tails of the histogram are more likely to 



correspond to non-null cases than null. lEfronl (120081 ) incorporates a similar belief through 



a zero- assumption: most z-scores near zero are from the null component. However, such a 
zero-assumption can be too strong to allow learning from data and can lead to an estimate 
of 7r/o that has heavier tails than any reasonable histogram-smoothing estimate of /, as 
reported by IStrimmerl ( 120081 ) and illustrated in Figure [TJ In comparison, separating fo 



and fi by their tails seems more practical; see Section [6j 

An important feature of the model is that fo and the kernel being mixed in ([2]) have a 
common scale. This, along with the assumption that ip is a density, are the driving forces 
behind the characteristic function-based proof of Theorem [Q not the distributional form 
of these densities. In fact, the proof in Appendix A applies when fo(z) = o" -1 p((z — fj)/o) 
for a suitably smooth density p symmetric about zero, and f\ is likewise determined by 
p. Here we focus only on the case where p is the N(0, 1) density, but other choices like a 
Student-t density with fixed degrees of freedom can also be entertained. 
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3 Mixture models and predictive recursion 



It is more convenient to write our specification of / as the mixture model 



/(*) 



p(z | 9, u) ^(du) 



(3) 



with parameters 9 = (/x, a, r), kernel p(z \ 9,u) = N(z \ fi + rau,a 2 ), and mixing 
probability measure ^ on % that assigns a positive mass tc at £ and distributes the 
remaining mass on according to a Lebesgue density ip. The collection of all such \I/ is 
the set P = P(^, v) of probability measures that are absolutely continuous with respect 
to the measure v defined as the sum of the Lebesgue measure on % and a point mass at 
0. The //-density of such an \1/ will be denoted by 7r(0) + (1 — Tr)ip. 

Inference on (9, can be performed in a Bayesian setting with a prior distribution on 
(9, \[>). A popular choice of prior distribution fo r the nonparametric probability measure 
\1/ is the Dirichlet process prior ( IFergusonl Il973l ) . However, there are two practical diffi- 
culties in employing this inference framework for our model. First, the Dirichlet process 
prior entertains only discrete probability measures, thus violating the important absolute 
continuity property of ^ with respect to v. Second, despite recent advances in computing, 
fitting a Dirichlet process mixture model does not scale well with the number of observa- 
tions n. For microarray studies, n ranges from thousands to tens of thousands, whereas 
for more recent single nucleotide polymorphism studies, n can reach several hundreds of 
thousands. For such massive data sets, fitting a Dirichlet process mixture model can be 
fairly time consuming, nullifying some of the advantages of the two-groups framework. 
As an alternative, we estimate (0, \EQ via the predictive recursion (PR) methodology 



( IMartin and Tokdarl 1201 It iNewtonl |2002| ). PR is a stochastic algorithm for estimating 



a mixing distribution \1/ in ([3]) through fast, recursive updates that have a strong con- 
nection with posterior updates for Dirichlet process mixture models. The algorithm ac- 
commodates user-specified absolute continuity constraints on the mixing distribution and 
enjoys attractive convergence properties under mild conditions with allowance for model 
misspecification (iGhosh and Tokdarl 120061 ; iMartin and Ghosh! 120081 ; iMartin and Tokdar 
20091 ; iTokdar et al.ll2009l ). However, Newton's original proposal can estimate the mixing 
distribution only when the kernel being mixed is known exactly, i.e., for (151), an estimate 
of_JMs available only when 9 is known. To resolve this difficulty, Martin and Tokdar 



( 1201 ll ) introduce a "marginal likelihood" function for non-mixing parameters 9 based on 
the output of the predictive recursion. 

PR Algorithm. Start with an initial estimate \I/o with ^-density 7r (0) + (1 — ir)ipo and 
a sequence of weights Wi, . . . , w n £ (0, 1). For i — 1, . . . , n compute 



fi-\,e{Zi) 



p(Zi | 9,u)^f i - 1 (du), 
1 - Wij^i-^du) + Wip(Zi | 6',it)*i_i(dit)// i _i i9 (Z i ). 



(4) 



Produce \l/ n as an estimate of \1/ and L n (9) = YYi=i fi-i,o(^i) as a marginal likelihood 
function for 9. 



Martin and Tokdarl (120111 ) point out several justifications for labeling L n {9) as a like- 



lihood function of 9. For n = 1, £i(#) equals the marginal likelihood function of 9, 
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integrating out \& under the Bayesian specification ^ ~ DP(ct, ^o); the Dirichlet process 
distribution with precision a = (1 — W\)/w\ and base measure \l/o- For n > 1, this cor- 
respondence is not exact, but L n (9) can be viewed as a filtering approximation of the 
corresponding Dirichlet process marginal likelihood function. Additionally, L n (6) features 
an asymptotic concentration property commonly e njoyed by li kelihood functions for in- 
dependent and identically distributed data models (jWaldlll949l ). Specifically, for large n, 
with Z±, . . . , Z n independently drawn from a common density /*, log L n {9) ~ —nK*(9), 
where K*(6) equals the minimum Kullback-Leibler divergence between /* and densities 
/ of the form ([3]) with \1/ ranging over the set P and all its weak limits points. 



4 Regularization and PRtest 

We employ a regularized version of the predictive recursion methodology to estimate 
(6, for our two groups model. The regularization is motivated by a hierarchical Bayes 
formulation of with \1/ ~ DP (a, \Po) where hyper-prior distributions are specified on 
the model parameters //, a, r and ^o- We take the i/-density of \I/o to be 7r (0) + (1 —ttq)^ 
with a fixed choice of ipo(u) oc u 2 . Among the remaining parameters, a G (0,oo), r G 
(1, oo ) and 7r G (0, 1) are taken to be independent with logo" ~ N(0, 0.25 2 ), log(r — 1) ~ 
N(0,1) and 7r ~ Beta(22.7, 1). Given a and the other parameters, /i is assigned the 
conditional prior distribution N(0,o- 2 /400). 

In our experience, a in the range [0.5, 2.0] is typical, and the log-normal prior puts 
nearly all of its mass there. Other priors for a may also be considered, such as a conjugate 
scaled inverse-chi distribution. The restriction r > 1 ensures that the non-null density 
fx is considerably wider than f Q , and the normal prior for log(r — 1) su pports a large 



set of values in this range. The 22.7 in the beta prior for 7r , also used by iBogdan et al. 



( 120081 ). assigns about 90% of its mass to the interval [0.9,1], reflecting the belief that 
the null proportion ir is likely to be large. Finally, the prior for n is scaled to the choice 
of o and highly concentrated around the origin, reflecting the belief that the z-scores 
should have mean close to zero. Finer tuning of this default prior for specific problems is 
straightforward. 

For a predictive recursion analog of this hierarchical Bayesian model, we interpret 
the predictive recursion likelihood as a function of both 9 = (/i, a, r) and ttq. Writing 
this likelihood as L n (fi, a, r, 7r ) and letting g({i,a,T,n ) denote the joint prior density 
function on these parameters, a regularized version of the predictive recursion marginal 
log-likelihood function can be written as 

I n (li, a, r, 7T ) = logL n (/i, cr, r, vr ) + log#(/i, a, r, vr ). (5) 

Estimates of these parameters are obtained by maximizing £ n = £ n (/i, a, r, n ). Once 
these estimates are obtained, predictive recursion is run one last time with the estimated 
values of these parameters to produce an estimate of F, i.e., of ir and of ip in ([I]) and 
02]), respectively. In our implementations, maximization of £ n is done by the gradient- 
based Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimization method. In Appendix B 
we provide a variation on the PR algorithm that produces the gradient of logL n as a 
by-product. 
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The predictive recursion methodology depends on two additional factors, namely, the 
choice of weights w-i w„, and th e order in which the z-scores are processed by the al- 



gorithm. Martin and Tokdarl (120091 ) provide an upper bound on the rate of convergence 
for PR estimates of the mixture / when the weights are of the form Wi = (i + 1)~ 7 , 
7 G (2/3, 1]. Our choice u>j = (i + l) -0 - 67 is close to the limit 7 = 2/3 where the upper 
bound is optimal. The recursive nature of the algorithm induces dependence on the order 
in which the Zi values are visited. We reduce this dependence by replacing £ n with its 
average over a number of random permutations of the data sequence. Averaging over 
permutatio ns increases the ove rall computation time, but adds stability to parameter 
estimation ( jTokdar et al.l 120091 ). In our experience, averaging over 10 random permuta- 
tions is sufficient to stabilize the estimates of 9, and the additional computation time 
required is negligible. To reduce variability due to random permutation, we keep the set 
of permutations fixed over the process of maximizing £ n . 



For multiple testing, we consider the local false discovery rate ( lEfronl 120041 ). given by 



fdr(z) = 7Tf (z)/f(z) 



which represents the posterior probability that a case with z-score Z = z is null. ISun and Cai 
( 120071 ) argue that the local false discovery rate is the fundamental quantity for multiple 
testing. Once regularized PR estimation of (//, a, r, ir, i/j) is completed, a plug-in estimate 
fdr of fdr is readily available, and PRtest is implemented by thresholding fdr; that is, we 
declare case % as non-null if fdr(Zj) < r for some specified threshold r G (0, 1). According 
to Efron, this multiple testing rule will control the Benjamini-Hochberg f alse discovery 
rate a t level r. In our examples we take r = 0.1. This choice, use d bv ISun and Cai 
( 2007). is somew hat subjective, but sits bet ween the choice r = 0.2 of lEfronl (120081 ) and 
Strimmerl (120081 ) and the choice r = 0.05 of I Jin and Cail (120071 ) and others. 



5 Simulations 



Here we investigate the performance of PRtest in several large-scale simulations where 
we can compare the results with the benchmark Bayes oracle te st. The res ults will also 



be compared to those obtained from t he Fo urier-based method of iJin and Cail (120071 ) and 
the mixfdr method of iMuralidharanl (j2010l ). 

For Zi, . . . , Z n , we assume independence and take the null density as fo(z) = U(z \ 
fi, a 2 ). Here we fix n = 1000, /i = 0, and a = 1. Four choices of fi are considered: 

CI: fi(z) — N(z I 0, a 2 + u 2 ). Taking u 2 = 13 ~ 2a 2 log n ensures the non-null z-scores 



are "detectable" ( iDonoho and Johnstone! Il994l ). But, in our experience, the range 
of z-scores one finds in real data analysis is consistent with smaller signals, so we 
take u 2 = 4. 



C2: U(z) = 0.5 CN (z I u, a 2 ) du. This choice, used by IMuralidharanl (120101 ) and 



Johnstone and Silverman! (12004! ). exhibits asymmetry and has only slightly heavier 



tails than the null. 



C3: h(z) = 0.67 N( 



-3,2) + 0.33 M(z I 3,2). This one is asymmetric and a large 



portion of its mass is concentrated away from the origin. 
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C4: fi(z) = 0.25 Jj_ 4 _ 2 i u r 2 41 | u, a 2 ) du. This is a symmetrized version of C2. A key 
feature of this choice is that the unobserved signals are bounded away from zero. 

For each of the four choices of j\, we consider six choices of tt ranging from 0.75 to 0.99, 
forming a total of 24 simulations settings. Each setting is replicated 500 times and the 
results are reported below. Our implementation of PR uses weights Wi = (i + l) -0 - 67 and 
the regularized likelihood l n is averaged over 10 permutations of the data sequence. 

Table [T] summarizes the estimates of the null parameters tt for each simulation setting. 
Estimates of (fi, a) are similarly accurate across methods, models, and sparsity, so these 
results are omitted. From the table we find that the maximum PR marginal likelihood 
estimates are the most adaptive across the range of tt values, specifically for choices C2 
C4. Of particular interest is PRtest's strong performance in the two most practically 
realistic cases, namely C3 and C4, which have smooth non-null densities with modes on 
both the left and right side of zero. Also the average computation time for PRtest is 
roughly 3 seconds, which compares favorably with that for Jin-Cai (0.7 seconds) and 
mixfdr (0.5 seconds). 

Next we compare the performance of the selected methods based on false non-discovery 
rate, false discovery rate, power, and Bayes risk. We limit this discussion to non-null 
choice C3; the results for the other models are similar. Figure [2] plots these quantities as 
functions of tt for the selected methods and the Bayes oracle procedure; the Bayes oracle 
is the rule based on thresholding the true fdr at level 0.1. The general message is that 
PRtest is competitive with the other tests in all aspects across a range of sparsity levels. 
In particular, the four tests are similar in terms of false non-discovery rate, particularly 
for large tt, but PRtest is better than mixfdr and Jin-Cai for relatively small tt. Also, 
each of the four tests have relatively small false discovery rates, although the Jin-Cai 
method has a somewhat unexpected spike, which explains its higher power for large tt 
values. Theoretically, the Bayes oracle test has the smallest Bayes risk uniformly over tt, 
but the PRtest risk sits very close over the entire range of tt. This obser v ation suggests 



that PRtest may be asymptotically optimal in the sense of iBogdan et all (120111 ) 



6 Examples 

6.1 Validation with spike- in data 



An interesting spike-in dataset was built by IChoe et al.l (120051 ). The dataset itself is 
artificial — so the set of interesting genes is known — but their careful construction gives it 
some features of a real control-versus-treatment microarray study. We consider a subset 
of this data (available in the R package st) consisting of 11,475 genes, of which 1,331 
are differentially expressed. Z-scores are obtained by taking a Gaussian transform of 
the standard two-sample t-test statistics. Figure E^a) shows histogram of the observed z- 
scores, along with the PRtest fit of the two-groups mixture model. The estimated density 
clearly fits the data very well, and the fdr thresholding method flags 235 genes as down- 
regulated. For comparison, Figure [3](b) reports an oracle fit of the two-groups model, 
where tt is estimated as the known proportion of differentially expressed genes, (/i, a) are 
estimated by maximum likelihood based on the null z-scores, and fx is estimated by a 
standard Gaussian kernel estimate based on the non-null z-scores; Table [2] reports the 
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h 



7T 



Jin-Cai 



mixfdr 



PRtest 



CI 



0.75 
0.80 
0.85 
0.90 
0.95 
0.99 



0.928 
0.929 
0.934 
0.945 
0.961 
0.978 



0.019 
0.019 
0.018 
0.015 
0.011 
0.005 



0.957 
0.965 
0.971 
0.980 
0.989 
0.995 



0.009 
0.007 
0.006 
0.005 
0.003 
0.001 



0.918 
0.930 
0.942 
0.960 
0.980 
0.995 



0.017) 
0.016) 
0.014) 
0.014) 
0.010) 
0.003) 



C2 



0.75 
0.80 
0.85 
0.90 
0.95 
0.99 



0.905 
0.874 
0.860 
0.869 
0.926 
0.984 



0.015 
0.019 
0.023 
0.028 
0.017 
0.007 



0.827 
0.860 
0.894 
0.927 
0.962 
0.991 



0.016 
0.012 
0.009 
0.007 
0.005 
0.003 



0.761 
0.804 
0.851 
0.896 
0.940 
0.980 



0.017) 
0.014) 
0.013) 
0.010) 
0.009) 
0.008) 



C3 



0.75 
0.80 
0.85 
0.90 
0.95 
0.99 



0.909 
0.886 
0.871 
0.886 
0.935 
0.980 



0.013 
0.015 
0.021 
0.020 
0.012 
0.004 



0.857 
0.881 
0.909 
0.937 
0.967 
0.991 



0.017 
0.013 
0.011 
0.008 
0.005 
0.003 



0.788 
0.828 
0.867 
0.903 
0.937 
0.982 



0.016) 
0.015) 
0.014) 
0.014) 
0.013) 
0.010) 



C4 



0.75 
0.80 
0.85 
0.90 
0.95 
0.99 



0.951 
0.934 
0.920 
0.908 
0.929 
0.980 



0.007 
0.010 
0.015 
0.025 
0.017 
0.007 



0.886 
0.897 
0.920 
0.948 
0.975 
0.995 



0.035 
0.015 
0.010 
0.007 
0.005 
0.002 



0.784 
0.814 
0.862 
0.901 
0.943 
0.992 



0.066) 
0.021) 
0.018) 
0.013) 
0.012) 
0.005) 



Table 1: Mean ( standard deviation) of the 500 estimates of tt for the method of 



Jin and Cail ( 120071 ). the mixfdr method of iMuralidharanl (120101 ). and PRtest for the four 



alternative densities (/i's) described in Section [5J 



9 



FNR 



FDR 




cc 

D 



o _| X 
d 



° j- 



A 




— * 



1 1 1 1 1 

0.75 0.80 0.85 0.90 0.95 



d 



CO 

d 



C\J 

d 



"T 



Power 









x ■ - . 


^0 


" A 


•+••.. 


••+.., 


o 



T" 




T" 



CC 



71 

Risk 




0.75 0.80 0.85 0.90 0.95 



0.75 0.80 0.85 0.90 0.95 



Figure 2: Plots of the false non-discovery rate (FNR, top left), false discovery rate (FDR, 
top right), power (bottom left), and Bayes risk (bottom right) against n for the selected 
testing procedures in the C3 simulation setting described in Section [5j 
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z z 

(a) PRtest fit (b) Oracle fit 

Figure 3: Histogram of the z-scores for the golden-spike data in Section l6TT| along with 
fits of the two-groups model using (a) PRtest and (b) the Oracle described in the text. In 
each plot, overlays are 7r/ (solid black line), (1 — n)fi (dashed black line), and / (solid 
gray line). The estimated fdr and the 0.1 threshold are shown on the negative scale. 
Numerical values on the top left and right indicate the number of genes flagged as down- 
and up-regulated, respectively, by the fdr thresholding rule. 



parameter estimates. This oracle procedure is, in some sense, the best fdr thresholding 
procedure one can hope for, and it flags 249 genes as down-regulated. 

For further comparison, we applied the methods of Efron, Jin and Cai, and Muralid- 
haran and the results are summarized in the top panel of Table [2j PRtest and the oracle 
perform similarly in every respect, while the other methods are substantially different. 
Only the Jin-Cai method is able to pick out a reasonable set of interesting genes, a 
bit larger than the sets identified by the oracle and PRtest. However, these additional 
discoveries result in a 50% increase in false discovery rate. 



6.2 Application to real data 



We applied PRtest, along with the methods of Efron, Jin and Cai, and Muralidharan 
to the t wo microarray gene expression datasets mentioned in Section [p the leukemia 



study by iGolub et all (Il999l ) and the hereditary breast cancer study by iHedenfalk et al. 
(120011 ). The parameter estimates and gene classifications are summarized in Table [3j 



In both datasets, PRtest estimates 7r to be relatively small and identifies a number of 
interesting genes, while the others identify none " see Figure HI P Rtest's findings in these 
two datasets are corroborated by the results of iLee et al.l (120031 ) who learn a treatment 
classifier from gene expression levels and validate it by accurately classifying samples 
from an independent test set. That is, t he set of interest ing; genes identified by PRtest 
substantially overlaps with the set of genes ILee et al.l (120031 ) flag as important constituents 
of their classifier; these are also displayed in Figure [U F or the breast cancer study, 



some of the genes identified by PRtest and ILee et al.l (120031 ). such as keratin 8, TOB 1 
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Number of genes 



Method 




a 


7T 


Left 


Right 


FDR 


FNR 


Efron 


0.33 


1.50 


0.99 


2 





0% 


12% 


Jin-Cai 


0.77 


1.45 


0.91 


306 





3% 


9% 


mixfdr 


0.28 


1.45 


0.97 


8 





0% 


12% 


PRtest 


0.42 


1.34 


0.88 


235 





2% 


10% 


Oracle 


0.30 


1.31 


0.88 


249 





2% 


10% 



Table 2: Results for the spike-in dataset considered in Section 16.11 FDR and FNR 
denote the false discovery and false non-discovery rates, respectively. Also, the "Oracle" 
method, as described in the text, uses the information about which genes are differentially 
expressed to estimate fdr. 



Number of genes 



Data 


Method 




o 


71 


Left 


Right 


Leukemia 


Efron 


0.57 


1.18 


0.88 


276 







Jin-Cai 


0.95 


1.30 


0.91 


291 







mixfdr 


0.56 


1.35 


0.96 


71 







PRtest 


0.23 


1.04 


0.63 


333 


226 


BRCA 


Efron 


-0.33 


1.45 


1.00 










Jin-Cai 


-0.42 


1.44 


1.00 










mixfdr 


-0.31 


1.38 


0.99 










PRtest 


-0.01 


1.04 


0.45 


231 


11 



Table 3: Results for the two real microarray datasets considered in Section [62 



and phosp hofructokinase p latelet, have known biological connections to breast cancer 



mutations (ILee et al.l 120031 . p. 93). The fact that the gene expression levels lead to a 
well- validated classifier suggests that some genes must be differentially expressed. In this 
light, it is surprising that the methods of Efron, Jin and Cai, and Muralidharan fail to 
identify a single interesting gene. 



7 Discussion 

This paper provides a new and identifiable semiparametric formulation of the two-groups 
model and a computationally efficient algorithm to estimate the model parameters. This 
naturally leads to a nonparametric empirical Bayes multiple testing rule based on thresh- 
olding the estimated local false discovery rate. In simulations we find that PRtest is 
comparable to existing methods, including the Bayes oracle. What is particularly inter- 
esting is that the PRtest results differ substantially from those of existing methods in the 
examples of Section [6], and we argue that our findings are, in fact, more believable. 

We have chosen to focus only on the case where the null z-scores are normally dis- 
tributed, though the theory and methods presented here work for other well behaved 
parametric families. Normality of null z-scores is indeed a strong structural assump- 
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(a) Leukemia 



(b) Breast cancer 



Figure 4: PRtest's fit to z-scores values from leukemia and breast cancer microarray 
data. Overlaid on the z-score histogram are the estimates of irfo (solid, black), (1 — 7r)/i 
(dashed, black) and / = 7r/ + (1 — tt)/i (solid, gray). The estimated fdr curve is shown 
on the negative scale, with the cut-o f f of 0. 1 marked by the horizontal grey, dashed line. 
The 27 genes identified by iLee et al. fl2003h in each data set are marked with a blue bar 
at their z-score, with the height of the bar indicating the posterior probability of being 
included in the classification model. For the leukemia da ta, red dots a re pla ced on the 
z-scores of the 50 genes identified in the original study by lGolub et al.l (119991 ) . 
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tion, but identification of the null from the non-null requires strong parametric shape 
restrictions on one of the two components. Assuming a normal null component is natural 
because, theoretically, the null z-scores should have a standard normal distribution. This 
is similar to p-value-based methods where the null p-values are assumed to be uniform. 
A purely statistical verification of this kind of assumption seems quite challenging. One 
could possibly gain insight on this issue through biological experiments consisting entirely 
of null cases. 

We have justified the continuous location mixture formulation of f\ in fl2]) on two 
grounds: first, it makes the model parameters identifiable and, second, it conforms to 
the accepted notion that the alternative is more likely than the null to produce z-scores 
of large magnitude. This latter property is also satisfied by a discrete mixture fi = 
^^ =1 7TjN(/i + rauj,a 2 ), for which the identifiability condition does not hold. But with 
the regularization to encourage selection of fo centered near zero, and the ability of a 
flexible continuous mixture to approximate a discrete one, PRtest might still perform well 
in this difficult situation. Our limited simulations seem to indicate that this is true. The 
case where fi is not wider than f also yields a coherent statistical simulation model, but 
we argue that it corresponds to a biologically untenable abstraction. Indeed, the multiple 
testing framework accepts the z-scores as scores whose magnitudes (possibly after a small 
shift of origin) give an ordering of how interesting the cases are relative to each other. The 
question is to decide how interesting a case must be in order to be labeled as non-null. 
Accepting the relative ordering is equivalent to accepting that f\ must be wider than f . 



Software 

R software to implement the proposed PRtest methodology can be found at S. Tokdar's 



website, http : //www . stat . duke . edu/~ st 118/Sof tware[ 
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A Proof of Theorem 1 



Here we prove a more general version of Theorem 1 in the main text. Let p(z) be a 
probability density function on R, sym metric about zero. Furthermore, assume p is 
supersmooth in the sense of iFanl (|199ll ); see below. In the main text, we took p(z) 
to be a N(0, 1) kernel but, e.g., a Student-t kernel with known degrees of freedom would 
also satisfy these conditions. 
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For the particular choice of p, define the density-valued mapping 
M(ji,a,T,n,il>)(z) = 7rcr _1 p((z - ji)/a) 

'1 — 710 / a~ l p({z — p — tu) I ' a)fau) du. 



To prove that (p, a, t, ir, fa are identifiable, we need to show that M is a one-to-one 
function. Therefore, we start by assuming M(p\, o\, r±, tti, fa) = M(p 2 , &2, t 2 , vr 2 , fa). 
Let p*(t) and fait) denote the characteristic functions of p(z) and fa(z), respectively, for 
k = 1,2. Then we must have 



exp(zfyti)p*(<7it){7Ti + (1 - 7ri)V'i(o-iV r i)} 

= exp(UfX2)p*(cr 2 t){iT2 + (1 - 7r 2 y*/4(cx 2 t/T 2 )} 



(6) 



for e very t € R, where i is the imaginary unit. Recall that the Riemann-Lebesgue lemma 
(e.g.. iBillingsleyl Il995l Theorem 26.1) says, for k = 1,2, 



fa k (t) -i> as t ->■ ±00. 



(7) 



Now, suppose <7i > o"2 and assume, without loss of generality, that /j 2 > 0. Choose a 
sequence {t s } C IR such that t s — > 00 and exp(it s/ u 2 ) = 1. Then, for large enough s, (0) 
implies that 7r 2 + (1 — 7r 2 )fa(cr 2 t s /T 2 ) 7^ 0. On rearranging the terms in we get 



P 



*(cr 2 t s ) exp(^/ii){7Ti + (1 - Tr^^^cri^/ri)} 



J9*((Tlt s 



7T 2 



TT 2 )fa(a 2 t s /T 2 ) 



We have assumed that p is supersmooth (|Fan!ll99ll ). which means that 
d \tf° exp{-\tf < \p*(t)\ <di|t| ft exp{-|t|77>, 



(9) 



for all rj and for some positive constants do, d\, /3q, 0\, 0, and 7. Under this assumption, 
the modulus of the left-hand side of (13) satisfies 



p*(a 1 t s 



> const x It, 



\P1-P0 



exp{|t 



^)/7>- 



Therefore, as s — > 00, the left-hand side of flBJ) is unbounded while the right-hand side is 
bounded. This is a contradiction, so we need U\ < cr 2 . But by symmetry, it follows that 
<j\ = er 2 . With this equality, relation fl§]) easily leads to the equalities p\ — p 2 , Tl = t 2 , 
txi = 7r 2 and fa = fa, completing the proof. 



B Gradient of the log PR marginal likelihood 



This section provides a variation on the predictive recursion (PR) algorithm that yields 
the gradient of the log P R marginal likelihood function, based on the development in 
(IMartin and Tokdarll201ll ). The model under consideration here is the following: 



f(z) = nN(z I p,, a 2 ) + 7T / H{z I p + rau, a 2 )tp(u) du, 
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where ip is an unknown mixing density supported on [—1,1]. The details of the PRtest 
method can be found in the main text. Here we focus only on computing the gradient of 
= Y^i=\ 1°S fi-i,o{Zi), where fk,e(z) is the PR estimate of the mixture density based 
on Zi, . . . , Zjs and 9 = (/x, a, r, 7r ), slightly different than in the main text. 

Define an unconstrained version of 9, i.e., rj = (/i, log a, log (r — 1), logit ir ), where 
logitx = log(j^—). In what follows, V will denote a gradient with respect to rj, and if g is 
a function of a variable u, then Wg(u) denotes the gradient with respect to rj, pointwise 
in u. The following algorithm shows how to compute Aj = / i _i i 6)( r; )(Zj) and VlogAj for 
i = 1, . . . , n. 

1. Start with user-specified 7Tq and fo, and set 



2. For i = 1, . . . , n, repeat the following three steps: 

(a) For the normal kernel p(z \ 9, u) = H(z | fi + crru, er 2 ), set 

G = N(Zj | fi, a 2 ) and Gi(it) = N(Zj | /i + am, a 2 ), 

and analytically evaluate the gradients VGo and VGi(w): 

VG = (zo/<t, zl- 1,0,0) -G 
VGi(u) = (zi(u)/a, z\{u)ru/a + z\(u) — 1, z\{u)u{r — 1), 0) • Gi(u), 

where z = [Z^ — ji)/a and Z\{u) = (Zi — fi — am) /a. 

(b) Compute 



Vtt = (0,0,0, 7T (1 -7T )) and Wo(u) = (0, 0, 0, 0). 




7Tj_iGo + (1 - Iti-ljhi 



V log ^ 




VlogA, 



V7Ti_iG + 7ri_iVG + hi{(\ - 7ri_i)Vlog/7,i - V^-i} 



(c) Update 



Vtt, 



AoV7Ti_i + VA 7rj_i 

{V5Ai(u) + BVAi(u)}^i_i(«) + 5Ai(«)V^_i(«) 



where 



A = 1 + Wi(G Q /Xi - 1) 

= l + w i (G 1 (u)/A i -l) 
£ = (1 - ^.0/(1 - AoTTi-O 
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and 



^{VGo-GoVlogAJ/Ai 
^{VdH-GxHVlogA.j/A, 

(ff A) - l)V7T i _ 1 + BVApTtj-l 
1 - A 7Ti_i 

3. Return the log-likelihood X/*=il°S^» anc ^ ^ s gradient J^™ =1 VlogAj. 
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